IEEE/ACM Transactions on Computational Biology and Bioinformatics — Latest Matching Preprints

1

GRNIX: A Graph Neural Network Framework for Explainable Gene Regulatory Network Inference in Autoimmune Diseases Using XAI

Manai, M. M.

2024-11-25 immunology 10.1101/2024.11.24.625043 medRxiv

Top 0.1%

30.7%

Show abstract

Autoimmune diseases result from dysregulated immune mechanisms influenced by complex gene regulatory net-works (GRNs). Deciphering these networks has significant implications for understanding disease mechanisms, predicting disease progression, and identifying novel therapeutic targets. Traditional GRN inference techniques rely on statistical correlations or deterministic models, which are limited in capturing nonlinear interactions and often fail to provide interpretable outputs. Machine learning (ML)-based approaches, while more powerful, typically function as black-box systems, impeding their adoption in clinical settings. To bridge this gap, we introduce GRNIX, a GRN inference framework designed to balance predictive accuracy with explainability. The framework integrates multi-omics data, incorporates biological and structural priors, and applies explainable artificial intelligence (XAI) techniques to enhance interpretability..

2

Drug-disease networks and drug repurposing

Polanco, A.; Newman, M.

2025-02-06 bioinformatics 10.1101/2025.01.31.634767 medRxiv

Top 0.1%

18.9%

Show abstract

Repurposing existing drugs to treat new diseases is a cost-effective alternative to de novo drug development, but there are millions of potential drug-disease combinations to be considered with only a small fraction being viable. In silico predictions of drug-disease associations can be invaluable for reducing the size of the search space. In this work we present a novel network of drugs and the diseases they treat, compiled using a combination of existing machine-readable and textual databases, natural-language processing tools, and hand curation, and analyze it using a selection of network-based link prediction methods to identify potential drug-disease combinations. We measure the efficacy of these methods using cross-validation tests and find that several methods, particularly those based on graph embedding and network model fitting, achieve impressive prediction performance, with area under the ROC curve above 0.95 and average precision almost a thousand times better than chance.

3

Machine learning fairness analysis on clinical data of the Emory Breast imaging dataset (EMBED)

Ramachandra, V.

2023-07-27 radiology and imaging 10.1101/2023.07.23.23293043 medRxiv

Top 0.1%

18.6%

Show abstract

This paper explores the use of machine learning (ML) for predictive modeling on clinical data from the EMory BrEast imaging Dataset (EMBED) [1] with a focus on ML model fairness analysis. The aim of this study is to develop and evaluate fair machine learning models that can accurately predict breast cancer risk. We trained and tested various machine-learning models. Our findings show that machine learning can be effective for predicting breast cancer risk or diagnosing breast cancer, and that fairness considerations are crucial in the development of such models. Overall, our study highlights the potential of machine learning for clinical applications while emphasizing the need for ethical and fair practices in this field.

4

ASTER: A Method to Predict Clinically Actionable Synthetic Lethal Interactions

Liany, H.; Jeyasekharan, A.; Rajan, V.

2020-10-28 bioinformatics 10.1101/2020.10.27.356717 medRxiv

Top 0.1%

18.4%

Show abstract

A Synthetic Lethal (SL) interaction is a functional relationship between two genes or functional entities where the loss of either entity is viable but the loss of both is lethal. Such pairs can be used to develop targeted anticancer therapies with fewer side effects and reduced overtreatment. However, finding clinically actionable SL interactions remains challenging. Leveraging unified gene expression data of both disease-free and cancerous samples, we design a new technique based on statistical hypothesis testing, called ASTER, to identify SL pairs. We empirically find that the patterns of mutually exclusivity ASTER finds using genomic and transcriptomic data provides a strong signal of SL. For large-scale multiple hypothesis testing, we develop an extension called ASTER++ that can utilize additional input gene features within the hypothesis testing framework. Our extensive experiments demonstrate the efficacy of ASTER in identifying SL pairs with potential therapeutic benefits. CCS CONCEPTS* Applied computing [->] Computational genomics; Health informatics; * Mathematics of computing [->] Hypothesis testing and confidence interval computation. ACM Reference FormatHerty Liany, Anand Jeyasekharan, and Vaibhav Rajan. 2021. ASTER: A Method to Predict Clinically Actionable Synthetic Lethal Genetic Interactions. In Proceedings of ACM Conference. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

5

DruID: Personalized Drug Recommendations by Integrating Multiple Biomedical Databases for Cancer

Liany, H.; Jeyasekharan, A.; Rajan, V.

2021-04-11 bioinformatics 10.1101/2021.04.11.439315 medRxiv

Top 0.1%

18.4%

Show abstract

Advances in next-generation sequencing technologies have led to the development of personalized genomic profiles in diagnostic panels that inform oncologists of alterations in clinically relevant genes. While targeted therapies for some alterations may be found, an effective therapeutic strategy should consider multiple and dependent genetic interactions that affect cancer progression, a task which remains challenging. There are ongoing efforts to profile cancer cells in-vitro, both to catalog their genomic information and study their sensitivity to various drugs. There is a need for tools that can interpret the personalized genomic profile of a patient in light of information from these biological and pre-clinical studies and recommend potentially useful drugs. To address this need, we develop a new algorithmic framework called DruID, to effectively combine drug efficacy predictions from a deep neural network model with information, such as drug sensitivity, drug-drug interactions and genetic dependencies, from multiple publicly available databases. We empirically evaluate DruID on cancer cell line data on which efficacy of many drugs have been experimentally determined. We find that DruID outperforms competing approaches and promises to be a useful tool in clinical decision-making.

6

DTPPI: predicting drug interactions using a weighted drug-protein network

Szydlik, S.; Taheri, G.

2025-01-08 systems biology 10.1101/2025.01.06.631638 medRxiv

Top 0.1%

18.3%

Show abstract

Polypharmacy, the practice of using multiple drugs to treat complex diseases, poses a significant risk of drug-drug interactions (DDIs), which can lead to unanticipated adverse drug reactions (ADRs) and toxicity. Identifying and understanding these DDIs is crucial to ensuring the safety of polypharmacy. Traditional laboratory-based methods for detecting DDI are costly and time consuming, prompting the development of computational approaches. However, many of these methods face limitations, mainly the lack of utilization of biological networks to model drug mechanics. Such an approach could lead to a new technique with better and more accurate DDI predictions. In response to these challenges, we propose the DTPPI network, a novel machine learning approach that leverages a drug-target-protein-protein interaction network to improve DDI prediction. By extracting topological features and combining them with biological drug features, the DTPPI method enhances the performance of a multilayer perceptron model. The evaluation results showed an AUC of 0.64 for topological characteristics alone, 0.89 for biological characteristics, and 0.91 for combined features, demonstrating that integrating topological and biological data significantly improves the prediction accuracy of DDI. Materials and implementations are available at: https://github.com/Golnazthr/DTPPI HighlightsO_LIConstructs a weighted graph network to model interactions among drugs, proteins, and targets, applicable to all drug types. C_LIO_LIExtracts six universal topological features from the graph, independent of chemical structure. C_LIO_LIEnhances DDI prediction by using topological features alone or alongside traditional drug features. C_LIO_LIIncorporates an MLP model for superior predictive accuracy using combined features. C_LI

7

Identification of targets for drug repurposing to treat COVID-19 using a Deep Learning Neural Network

Wang, S. H.; Tang, Y.-H.; Hsu, H.; Yu, C.-N.; Lee, O. K.-S.

2023-05-29 genetic and genomic medicine 10.1101/2023.05.23.23290403 medRxiv

Top 0.1%

18.2%

Show abstract

The COVID-19 pandemic has resulted in a global public health crisis requiring immediate acute therapeutic solutions. To address this challenge, we developed a useful tool deep learning model using the graph-embedding convolution network (GECN) algorithm. Our approach identified COVID-19-related genes and potential druggable targets, including tyrosine kinase ABL1/2, pro-inflammatory cytokine CSF2, and pro-fibrotic cytokines IL-4 and IL-13. These target genes are implicated in critical processes related to COVID-19 pathogenesis, including endosomal membrane fusion, cytokine storm, and tissue fibrosis. Our analysis revealed that ABL kinase inhibitors, lenzilumab (anti-CSF2), and dupilumab (anti-IL4R) represent promising therapeutic solutions that can effectively block virus-host membrane fusion or attenuate hyperinflammation in COVID-19 patients. Compared to the traditional drug screening process, our GECN algorithm enables rapid analysis of disease-related human protein interaction networks and prediction of candidate drug targets from a large-scale knowledge graph in a cost-effective and efficient manner. Overall, Overall, our results suggest that the model has the potential to facilitate drug repurposing and aid in the fight against COVID-19.

8

A consistent evaluation of miRNA-disease association prediction models

Dong, T. N. N.; Khosla, M.

2020-06-13 bioinformatics 10.1101/2020.05.04.075754 medRxiv

Top 0.1%

18.2%

Show abstract

MotivationA variety of machine learning based approaches have been applied to predicting miRNA-disease association. Although promising, the evaluation set up to measure prediction performance is inconsistent making it difficult to assess the actual progress. A more acute problem is that most of the models overlook the problem of data leakage due to the use of precomputed miRNA and disease similarity features. ResultsWe unearth a crucial problem of data leakage in evaluation of machine learning models for miRNA-disease association prediction. In particular, information from test set, in the form of precomputed input features for miRNA and disease, is used during training of the model. Moreover, we point out problems in the widely used performance metrics used in model evaluation. While resolving the issues of data leakage and model evaluation, we perform an indepth study of 3 recent models along with our proposed 9 variants of these models. Our proposed variants have resulted in improvements in Average Precision scores (as compared to original models) by approximately 287.7% and 36.7% on HMDDv2.0 (AP:0.504) and HMDDv3.0 (AP: 0.216) datasets respectively. Availability and ImplementationWe release a unified evaluation framework including all models and datasets at https://git.l3s.uni-hannover.de/dong/simplifying_mirna_disease.

9

GraMDTA: Multimodal Graph Neural Networks for Predicting Drug-Target Associations

Yella, J. K.; Ghandikota, S. K.; Jegga, A. G.

2022-09-02 bioinformatics 10.1101/2022.08.30.505168 medRxiv

Top 0.1%

18.0%

Show abstract

Finding novel drug-target associations is vital for drug discovery. However, screening millions of small molecules for a select target protein is challenging. Several computational approaches have been developed in the past using Machine learning methods for computational drug-target association (DTA) prediction predominantly use structural data of drugs and proteins. Some of these approaches use knowledge graph networks and link prediction. To the best of our knowledge there have been no approaches that use both structural learning that offers molecular-based representations and knowledge graph-based learning which offers interaction-based representations for DTA discovery. Based on the premise that multimodal sources of information acting complimentarily could improve the robustness of DTA predictions, we developed GraMDTA, a multimodal graph neural network that learns both structural and knowledge graph representations utilizing multi-head attention to fuse the multimodal representations. We compare GraMDTA with other computational approaches for DTA prediction to demonstrate the power of multimodal fusion for discovery of DTA.

10

Graphical Learning and Causal Inference for Drug Repurposing

Xu, T.; Zhao, J.; Xiomg, M.

2023-08-02 genetic and genomic medicine 10.1101/2023.07.29.23293346 medRxiv

Top 0.1%

15.5%

Show abstract

Gene expression profiles that connect drug perturbations, disease gene expression signatures, and clinical data are important for discovering potential drug repurposing indications. However, the current approach to gene expression reversal has several limitations. First, most methods focus on validating the reversal expression of individual genes. Second, there is a lack of causal approaches for identifying drug repurposing candidates. Third, few methods for passing and summarizing information on a graph have been used for drug repurposing analysis, with classical network propagation and gene set enrichment analysis being the most common. Fourth, there is a lack of graph-valued association analysis, with current approaches using real-valued association analysis one gene at a time to reverse abnormal gene expressions to normal gene expressions. To overcome these limitations, we propose a novel causal inference and graph neural network (GNN)-based framework for identifying drug repurposing candidates. We formulated a causal network as a continuous constrained optimization problem and developed a new algorithm for reconstructing large-scale causal networks of up to 1,000 nodes. We conducted large-scale simulations that demonstrated good false positive and false negative rates. To aggregate and summarize information on both nodes and structure from the spatial domain of the causal network, we used directed acyclic graph neural networks (DAGNN). We also developed a new method for graph regression in which both dependent and independent variables are graphs. We used graph regression to measure the degree to which drugs reverse altered gene expressions of disease to normal levels and to select potential drug repurposing candidates. To illustrate the application of our proposed methods for drug repurposing, we applied them to phase I and II L1000 connectivity map perturbational profiles from the Broad Institute LINCS, which consist of gene-expression profiles for thousands of perturbagens at a variety of time points, doses, and cell lines, as well as disease gene expression data under-expressed and over-expressed in response to SARS-CoV-2.

11

Network-based clustering for drug sensitivity prediction in cancer cell lines

Pouryahya, M.; Oh, J. H.; Mathews, J. C.; Belkhatir, Z.; Moosmuller, C.; Deasy, J. O.; Tannenbaum, A. R.

2019-09-18 bioinformatics 10.1101/764043 medRxiv

Top 0.1%

15.2%

Show abstract

The study of large-scale pharmacogenomics provides an unprecedented opportunity to develop computational models that can accurately predict large cohorts of cell lines and drugs. In this work, we present a novel method for predicting drug sensitivity in cancer cell lines which considers both cell line genomic features and drug chemical features. Our network-based approach combines the theory of optimal mass transport (OMT) with machine learning techniques. It starts with unsupervised clustering of both cell line and drug data, followed by the prediction of drug sensitivity in the paired cluster of cell lines and drugs. We show that prior clustering of the heterogenous cell lines and structurally diverse drugs significantly improves the accuracy of the prediction. In addition, it facilities the interpretability of the results and identification of molecular biomarkers which are significant for both clustering of the cell lines and predicting the drug response.

12

Machine Learning for Predicting Therapeutic Outcomes in Acute Myeloid Leukemia Patients

Karathanasis, N.; Papasavva, P.; Oulas, A.; Spyrou, G. M.

2024-03-02 genetic and genomic medicine 10.1101/2024.02.29.24303536 medRxiv

Top 0.1%

15.1%

Show abstract

Background and ObjectiveThe standard of care in Acute Myeloid Leukemia patients has remained essentially unchanged for nearly 40 years. Due to the complicated mutational patterns within and between individual patients and a lack of targeted agents for most mutational events, implementing individualized treatment for AML has proven difficult. We reanalysed the BeatAML dataset employing Machine Learning algorithms. The BeatAML project entails patients extensively characterized at the molecular and clinical levels and linked to drug sensitivity outputs. Our approach capitalizes on the molecular and clinical data provided by the BeatAML dataset to predict the ex vivo drug sensitivity for the 122 drugs evaluated by the project. MethodsWe utilized ElasticNet, which produces fully interpretable models, in combination with a two-step training protocol that allowed us to narrow down computations. We automated the genes filtering step by employing two metrics, and we evaluated all possible data combinations to identify the best training configuration settings per drug. ResultsWe report a Pearson correlation across all drugs of 0.36 when clinical and RNA sequencing data were combined, with the best-performing models reaching a Pearson correlation of 0.67. When we trained using the datasets in isolation, we noted that RNA Sequencing data (Pearson: 0.36) attained three times the predictive power of whole exome sequencing data (Pearson: 0.11), with clinical data falling somewhere in between (Pearson 0.26). Lastly, we present a paradigm of clinical significance. We used our models prediction as a health management score to rank an individuals expected response to treatment. We identified 78 patients out of 89 (88%) that the proposed drug was more potent than the administered one based on their ex vivo drug sensitivity data. ConclusionsIn conclusion, our reanalysis of the BeatAML dataset using Machine Learning algorithms demonstrates the potential for individualized treatment prediction in Acute Myeloid Leukemia patients, addressing the longstanding challenge of treatment personalization in this disease. By leveraging molecular and clinical data, our approach yields promising correlations between predicted drug sensitivity and actual responses, highlighting a significant step forward in improving therapeutic outcomes for AML patients. HighlightsO_LIMachine learning can predict response to treatment in Acute Myeloid Leukemia patients. C_LIO_LIRNA sequencing data are more informative than whole exome sequencing and clinical data in predicting drug response in Acute Myeloid Leukemia patients. C_LIO_LIDrug response predictions could be used as a health management score to rank the individuals expected response to treatment. C_LIO_LIWe identified a more potent drug than the administered one for 88% (78 out of 89) of the patients examined. C_LI

13

A Non-Negative Matrix Tri-Factorization based Method for Predicting Antitumor Drug Sensitivity

Pido, S.; Testa, C.; Pinoli, P.

2021-12-06 bioinformatics 10.1101/2021.12.03.471100 medRxiv

Top 0.1%

15.0%

Show abstract

Large annotated cell line collections have been proven to enable the prediction of drug response in the preclinical setting. We present an enhancement of Non-Negative Matrix Tri-Factorization method, which allows the integration of different data types for the prediction of missing associations. To test our method we retrieved a dataset from CCLE, containing the connections among cell lines and drugs by means of their IC50 values. We performed two different kind of experiments: a) prediction of missing values in the matrix, b) prediction of the complete drug profile of a new cell line, demonstrating the validity of the method in both scenarios.

14

Drug-Target Interaction prediction using Multi-Graph Regularized Deep Matrix Factorization

Mongia, A.; Majumdar, A.

2019-09-19 bioinformatics 10.1101/774539 medRxiv

Top 0.1%

15.0%

Show abstract

Drug discovery is an important field in the pharmaceutical industry with one of its crucial chemogenomic process being drug-target interaction prediction. This interaction determination is expensive and laborious, which brings the need for alternative computational approaches which could help reduce the search space for biological experiments. This paper proposes a novel framework for drug-target interaction (DTI) prediction: Multi-Graph Regularized Deep Matrix Factorization (MGRDMF). The proposed method, motivated by the success of deep learning, finds a low-rank solution which is structured by the proximities of drugs and targets (drug similarities and target similarities) using deep matrix factorization. Deep matrix factorization is capable of learning deep representations of drugs and targets for interaction prediction. It is an established fact that drug and target similarities incorporation preserves the local geometries of the data in original space and learns the data manifold better. However, there is no literature on which the type of similarity matrix (apart from the standard biological chemical structure similarity for drugs and genomic sequence similarity for targets) could best help in DTI prediction. Therefore, we attempt to take into account various types of similarities between drugs/targets as multiple graph Laplacian regularization terms which take into account the neighborhood information between drugs/targets. This is the first work which has leveraged multiple similarity/neighborhood information into the deep learning framework for drug-target interaction prediction. The cross-validation results on four benchmark data sets validate the efficacy of the proposed algorithm by outperforming shallow state-of-the-art computational methods on the grounds of AUPR and AUC.

15

LinkDTI: Drug-Target Interactionsprediction through a Link Predictionframework on Biomedical KnowledgeGraph

Mondal, M.; Arunachalam, S.; Wu, S.; Datta, A.

2026-02-23 bioinformatics 10.64898/2026.02.21.707210 medRxiv

Top 0.1%

14.9%

Show abstract

Computational drug-target interactions (DTI) prediction serves as a valuable tool for drug discovery and repurposing by cost-effectively narrowing down the potential drug-target space. This paper presents LinkDTI, a computational framework that predicts DTIs by identifying connections within a heterogeneous knowledge graph of drugs, proteins, diseases, and side effects. Unlike methods that rely on mathematical techniques like matrix completion or similarity-based scoring, LinkDTI uses an advanced graph-based approach to capture relationships between biomedical entities. Specifically, LinkDTI applies a modified version of the multilayer GraphSAGE model that learns from the heterogeneous knowledge graph and predicts potential drug-target interactions. Our model incorporates negative sampling that balances the data to address the issue of having more negative than positive interactions. Our results show that LinkDTI consistently performs better in AUROC and AUPRC than baseline methods by at least 2.5% across different sampling ratios and conditions. Subsequently, it identifies approximately 945 new potential DTIs, marking a 49.14% increase over known DTIs. Overall, LinkDTI offers a simple yet effective method for integrating diverse biomedical data to identify potential drug-target interactions. The code and data can be found at https://github.com/hub2nature/LinkDTI_heterogenous_KG.git.

16

TranDTA: Prediction Of Drug Target Binding Affinity Using Transformer Representations

Saadat, M.; Behjati, A.; Zare-Mirakabad, F.; Gharaghani, S.

2021-10-02 bioinformatics 10.1101/2021.09.30.462610 medRxiv

Top 0.1%

13.2%

Show abstract

Drug discovery is generally difficult, expensive, and low success rate. One of the essential steps in the early stages of drug discovery and drug repurposing is identifying drug-target interactions. Binding affinity indicates the strength of drug-target pair interactions. In this regard, several computational methods have been developed to predict the drug-target binding affinity, and the input representation of these models has been shown to be very effective in improving accuracy. Although the recent models predict binding affinity more accurate than the first ones, they need the structure of target proteins. Despite the strong interest in protein structure, there is a massive gap between known sequences and experimentally determined structures. Therefore, finding an appropriate presentation for drug and protein sequences is vital for drug-target binding affinity prediction. In this paper, our primary goal is to assess the drug and protein sequence representation for improving drug-target binding affinity prediction.

17

Unveiling the Robustness of Machine Learning Models in Classifying COVID-19 Spike Sequences

Ali, S.; Chen, P.-Y.; Patterson, M.

2023-08-24 bioinformatics 10.1101/2023.08.24.554651 medRxiv

Top 0.1%

13.0%

Show abstract

In the midst of the global COVID-19 pandemic, a wealth of data has become available to researchers, presenting a unique opportunity to investigate the behavior of the virus. This research aims to facilitate the design of efficient vaccinations and proactive measures to prevent future pandemics through the utilization of machine learning (ML) models for decision-making processes. Consequently, ensuring the reliability of ML predictions in these critical and rapidly evolving scenarios is of utmost importance. Notably, studies focusing on the genomic sequences of individuals infected with the coronavirus have revealed that the majority of variations occur within a specific region known as the spike (or S) protein. Previous research has explored the analysis of spike proteins using various ML techniques, including classification and clustering of variants. However, it is imperative to acknowledge the possibility of errors in spike proteins, which could lead to misleading outcomes and misguide decision-making authorities. Hence, a comprehensive examination of the robustness of ML and deep learning models in classifying spike sequences is essential. In this paper, we propose a framework for evaluating and benchmarking the robustness of diverse ML methods in spike sequence classification. Through extensive evaluation of a wide range of ML algorithms, ranging from classical methods like naive Bayes and logistic regression to advanced approaches such as deep neural networks, our research demonstrates that utilizing k-mers for creating the feature vector representation of spike proteins is more effective than traditional one-hot encoding-based embedding methods. Additionally, our findings indicate that deep neural networks exhibit superior accuracy and robustness compared to non-deep-learning baselines. To the best of our knowledge, this study is the first to benchmark the accuracy and robustness of machine-learning classification models against various types of random corruptions in COVID-19 spike protein sequences. The benchmarking framework established in this research holds the potential to assist future researchers in gaining a deeper understanding of the behavior of the coronavirus, enabling the implementation of proactive measures and the prevention of similar pandemics in the future.

18

BRDKRM: An Explainable Framework for Disease Modifying Drug Identification

Chandra, A.; Dey, A.; Chakraborty, M.; Maulik, U. B.; Bandyopadhyay, S.

2024-09-26 bioinformatics 10.1101/2024.09.24.614653 medRxiv

Top 0.1%

12.9%

Show abstract

Drug classification into disease-modifying (DM) and symptomatic (SYM) categories is crucial for clinical decision-making and therapeutic strategy development. To address the limitations of current methods, which often lack transparency and interpretability, we propose the Boundary Restricted Dynamic Key Route Mapping (BRDKRM) framework. This novel approach leverages the contextual overlap between disease and drug nodes in a heterogeneous graph, aggregating genes from the top K shortest paths to delineate disease neighborhood boundaries. Inspired by the classic Hansel and Gretel folklore, BRDKRM metaphorically marks boundary nodes along metapaths from disease to drug, akin to Hansel s breadcrumbs, which are then used to classify the therapeutic effect of candidate drugs. Our method achieved 86.78% accuracy in categorizing drug-disease treatments and identified 530 genes involved in both disease modification and symptomatic relief. The efficacy of BRDKRM is demonstrated through case studies on multiple sclerosis, offering an explainable approach to drug classification that bypasses extensive clinical trials. By providing biologically sound interpretations of drug classifications, our framework enhances understanding of therapeutic interventions, paving the way for more precise and efficient healthcare solutions while offering a novel approach to mapping disease-drug interactions.

19

GraphGR: A graph neural network to predict the effect of pharmacotherapy on the cancer cell growth

Singha, M.; Pu, L.; Shawky, A.-E.-M.; Busch, K.; Wu, H.-C.; Ramanujam, J.; Brylinski, M.

2020-05-23 bioinformatics 10.1101/2020.05.20.107458 medRxiv

Top 0.1%

12.8%

Show abstract

Genomic profiles of cancer cells provide valuable information on genetic alterations in cancer. Several recent studies employed these data to predict the response of cancer cell lines to treatment with drugs. Nonetheless, due to the multifactorial phenotypes and intricate mechanisms of cancer, the accurate prediction of the effect of pharmacotherapy on a specific cell line based on the genetic information alone is problematic. High prediction accuracies reported in the literature likely result from significant overlaps among training, validation, and testing sets, making many predictors inapplicable to new data. To address these issues, we developed GraphGR, a graph neural network with sophisticated attention propagation mechanisms to predict the therapeutic effects of kinase inhibitors across various tumors. Emphasizing on the system-level complexity of cancer, GraphGR integrates multiple heterogeneous data, such as biological networks, genomics, inhibitor profiling, and genedisease associations, into a unified graph structure. In order to construct diverse and information-rich cancer-specific networks, we devised a novel graph reduction protocol based on not only the topological information, but also the biological knowledge. The performance of GraphGR, properly cross-validated at the tissue level, is 0.83 in terms of the area under the receiver operating characteristics, which is notably higher than those measured for other approaches on the same data. Finally, several new predictions are validated against the biomedical literature demonstrating that GraphGR generalizes well to unseen data, i.e. it can predict therapeutic effects across a variety of cancer cell lines and inhibitors. GraphGR is freely available to the academic community at https://github.com/pulimeng/GraphGR.

20

Are under-studied proteins under-represented? How to fairly evaluate link prediction algorithms in network biology

Yilmaz, S.; Yorgancioglu, K.; Koyuturk, M.

2022-10-17 systems biology 10.1101/2022.10.13.511953 medRxiv

Top 0.1%

12.8%

Show abstract

For biomedical applications, new link prediction algorithms are continuously being developed and these algorithms are typically evaluated computationally, using test sets generated by sampling the edges uniformly at random. However, as we demonstrate, this evaluation approach introduces a bias towards "rich nodes", i.e., those with higher degrees in the network. More concerningly, this bias persists even when different network snapshots are used for evaluation, as recommended in the machine learning community. This creates a cycle in research where newly developed algorithms generate more knowledge on well-studied biological entities while under-studied entities are commonly overlooked. To overcome this issue, we propose a weighted validation setting specifically focusing on under-studied entities and present AWARE strategies to facilitate bias-aware training and evaluation of link prediction algorithms. These strategies can help researchers gain better insights from computational evaluations and promote the development of new algorithms focusing on novel findings and under-studied proteins. TeaserSystematically characterizes and mitigates bias toward well-studied proteins in the evaluation pipeline for machine learning. Code and data availabilityAll materials (code and data) to reproduce the analyses and figures in the paper is available in figshare (doi:10.6084/m9.figshare.21330429). The code for the evaluation framework implementing the proposed strategies is available at github{dagger}. We provide a web tool{ddagger} to assess the bias in benchmarking data and to generate bias-adjusted test sets.